{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Predicting Survival on the Titanic\n",
    "\n",
    "### History\n",
    "Perhaps one of the most infamous shipwrecks in history, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 people on board. Interestingly, by analysing the probability of survival based on few attributes like gender, age, and social status, we can make very accurate predictions on which passengers would survive. Some groups of people were more likely to survive than others, such as women, children, and the upper-class. Therefore, we can learn about the society priorities and privileges at the time.\n",
    "\n",
    "### Assignment:\n",
    "\n",
    "Build a Machine Learning Pipeline, to engineer the features in the data set and predict who is more likely to Survive the catastrophe.\n",
    "\n",
    "Follow the Jupyter notebook below, and complete the missing bits of code, to achieve each one of the pipeline steps."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import re\n",
    "\n",
    "# to handle datasets\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "\n",
    "# for visualization\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "# to divide train and test set\n",
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "# feature scaling\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "\n",
    "# to build the models\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "\n",
    "# to evaluate the models\n",
    "from sklearn.metrics import accuracy_score, roc_auc_score\n",
    "\n",
    "# to persist the model and the scaler\n",
    "import joblib\n",
    "\n",
    "# ========== NEW IMPORTS ========\n",
    "# Respect to notebook 02-Predicting-Survival-Titanic-Solution\n",
    "\n",
    "# pipeline\n",
    "\n",
    "\n",
    "# for the preprocessors\n",
    "\n",
    "\n",
    "# for imputation\n",
    "\n",
    "\n",
    "# for encoding categorical variables\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Prepare the data set"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>pclass</th>\n",
       "      <th>survived</th>\n",
       "      <th>name</th>\n",
       "      <th>sex</th>\n",
       "      <th>age</th>\n",
       "      <th>sibsp</th>\n",
       "      <th>parch</th>\n",
       "      <th>ticket</th>\n",
       "      <th>fare</th>\n",
       "      <th>cabin</th>\n",
       "      <th>embarked</th>\n",
       "      <th>boat</th>\n",
       "      <th>body</th>\n",
       "      <th>home.dest</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>Allen, Miss. Elisabeth Walton</td>\n",
       "      <td>female</td>\n",
       "      <td>29</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>24160</td>\n",
       "      <td>211.3375</td>\n",
       "      <td>B5</td>\n",
       "      <td>S</td>\n",
       "      <td>2</td>\n",
       "      <td>?</td>\n",
       "      <td>St Louis, MO</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>Allison, Master. Hudson Trevor</td>\n",
       "      <td>male</td>\n",
       "      <td>0.9167</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>113781</td>\n",
       "      <td>151.55</td>\n",
       "      <td>C22 C26</td>\n",
       "      <td>S</td>\n",
       "      <td>11</td>\n",
       "      <td>?</td>\n",
       "      <td>Montreal, PQ / Chesterville, ON</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>Allison, Miss. Helen Loraine</td>\n",
       "      <td>female</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>113781</td>\n",
       "      <td>151.55</td>\n",
       "      <td>C22 C26</td>\n",
       "      <td>S</td>\n",
       "      <td>?</td>\n",
       "      <td>?</td>\n",
       "      <td>Montreal, PQ / Chesterville, ON</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>Allison, Mr. Hudson Joshua Creighton</td>\n",
       "      <td>male</td>\n",
       "      <td>30</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>113781</td>\n",
       "      <td>151.55</td>\n",
       "      <td>C22 C26</td>\n",
       "      <td>S</td>\n",
       "      <td>?</td>\n",
       "      <td>135</td>\n",
       "      <td>Montreal, PQ / Chesterville, ON</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>Allison, Mrs. Hudson J C (Bessie Waldo Daniels)</td>\n",
       "      <td>female</td>\n",
       "      <td>25</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>113781</td>\n",
       "      <td>151.55</td>\n",
       "      <td>C22 C26</td>\n",
       "      <td>S</td>\n",
       "      <td>?</td>\n",
       "      <td>?</td>\n",
       "      <td>Montreal, PQ / Chesterville, ON</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   pclass  survived                                             name     sex  \\\n",
       "0       1         1                    Allen, Miss. Elisabeth Walton  female   \n",
       "1       1         1                   Allison, Master. Hudson Trevor    male   \n",
       "2       1         0                     Allison, Miss. Helen Loraine  female   \n",
       "3       1         0             Allison, Mr. Hudson Joshua Creighton    male   \n",
       "4       1         0  Allison, Mrs. Hudson J C (Bessie Waldo Daniels)  female   \n",
       "\n",
       "      age  sibsp  parch  ticket      fare    cabin embarked boat body  \\\n",
       "0      29      0      0   24160  211.3375       B5        S    2    ?   \n",
       "1  0.9167      1      2  113781    151.55  C22 C26        S   11    ?   \n",
       "2       2      1      2  113781    151.55  C22 C26        S    ?    ?   \n",
       "3      30      1      2  113781    151.55  C22 C26        S    ?  135   \n",
       "4      25      1      2  113781    151.55  C22 C26        S    ?    ?   \n",
       "\n",
       "                         home.dest  \n",
       "0                     St Louis, MO  \n",
       "1  Montreal, PQ / Chesterville, ON  \n",
       "2  Montreal, PQ / Chesterville, ON  \n",
       "3  Montreal, PQ / Chesterville, ON  \n",
       "4  Montreal, PQ / Chesterville, ON  "
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# load the data - it is available open source and online\n",
    "\n",
    "data = pd.read_csv('https://www.openml.org/data/get_csv/16826755/phpMYEkMl')\n",
    "\n",
    "# display data\n",
    "data.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "# replace interrogation marks by NaN values\n",
    "\n",
    "data = data.replace('?', np.nan)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "# retain only the first cabin if more than\n",
    "# 1 are available per passenger\n",
    "\n",
    "def get_first_cabin(row):\n",
    "    try:\n",
    "        return row.split()[0]\n",
    "    except:\n",
    "        return np.nan\n",
    "    \n",
    "data['cabin'] = data['cabin'].apply(get_first_cabin)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "# extracts the title (Mr, Ms, etc) from the name variable\n",
    "\n",
    "def get_title(passenger):\n",
    "    line = passenger\n",
    "    if re.search('Mrs', line):\n",
    "        return 'Mrs'\n",
    "    elif re.search('Mr', line):\n",
    "        return 'Mr'\n",
    "    elif re.search('Miss', line):\n",
    "        return 'Miss'\n",
    "    elif re.search('Master', line):\n",
    "        return 'Master'\n",
    "    else:\n",
    "        return 'Other'\n",
    "    \n",
    "data['title'] = data['name'].apply(get_title)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "# cast numerical variables as floats\n",
    "\n",
    "data['fare'] = data['fare'].astype('float')\n",
    "data['age'] = data['age'].astype('float')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>pclass</th>\n",
       "      <th>survived</th>\n",
       "      <th>sex</th>\n",
       "      <th>age</th>\n",
       "      <th>sibsp</th>\n",
       "      <th>parch</th>\n",
       "      <th>fare</th>\n",
       "      <th>cabin</th>\n",
       "      <th>embarked</th>\n",
       "      <th>title</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>female</td>\n",
       "      <td>29.0000</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>211.3375</td>\n",
       "      <td>B5</td>\n",
       "      <td>S</td>\n",
       "      <td>Miss</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>male</td>\n",
       "      <td>0.9167</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>151.5500</td>\n",
       "      <td>C22</td>\n",
       "      <td>S</td>\n",
       "      <td>Master</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>female</td>\n",
       "      <td>2.0000</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>151.5500</td>\n",
       "      <td>C22</td>\n",
       "      <td>S</td>\n",
       "      <td>Miss</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>male</td>\n",
       "      <td>30.0000</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>151.5500</td>\n",
       "      <td>C22</td>\n",
       "      <td>S</td>\n",
       "      <td>Mr</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>female</td>\n",
       "      <td>25.0000</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>151.5500</td>\n",
       "      <td>C22</td>\n",
       "      <td>S</td>\n",
       "      <td>Mrs</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   pclass  survived     sex      age  sibsp  parch      fare cabin embarked  \\\n",
       "0       1         1  female  29.0000      0      0  211.3375    B5        S   \n",
       "1       1         1    male   0.9167      1      2  151.5500   C22        S   \n",
       "2       1         0  female   2.0000      1      2  151.5500   C22        S   \n",
       "3       1         0    male  30.0000      1      2  151.5500   C22        S   \n",
       "4       1         0  female  25.0000      1      2  151.5500   C22        S   \n",
       "\n",
       "    title  \n",
       "0    Miss  \n",
       "1  Master  \n",
       "2    Miss  \n",
       "3      Mr  \n",
       "4     Mrs  "
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# drop unnecessary variables\n",
    "\n",
    "data.drop(labels=['name','ticket', 'boat', 'body','home.dest'], axis=1, inplace=True)\n",
    "\n",
    "# display data\n",
    "data.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "# # save the data set\n",
    "\n",
    "# data.to_csv('titanic.csv', index=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Begin Assignment\n",
    "\n",
    "## Configuration"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "# list of variables to be used in the pipeline's transformers\n",
    "\n",
    "NUMERICAL_VARIABLES = \n",
    "\n",
    "CATEGORICAL_VARIABLES = \n",
    "\n",
    "CABIN = "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Separate data into train and test"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "((1047, 9), (262, 9))"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X_train, X_test, y_train, y_test = train_test_split(\n",
    "    data.drop('survived', axis=1),  # predictors\n",
    "    data['survived'],  # target\n",
    "    test_size=0.2,  # percentage of obs in test set\n",
    "    random_state=0)  # seed to ensure reproducibility\n",
    "\n",
    "X_train.shape, X_test.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Preprocessors\n",
    "\n",
    "### Class to extract the letter from the variable Cabin"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "class ExtractLetterTransformer():\n",
    "    # Extract fist letter of variable\n",
    "\n",
    "    def __init__():\n",
    "        \n",
    "\n",
    "\n",
    "    def fit():\n",
    "\n",
    "        \n",
    "\n",
    "    def transform():\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Pipeline\n",
    "\n",
    "- Impute categorical variables with string missing\n",
    "- Add a binary missing indicator to numerical variables with missing data\n",
    "- Fill NA in original numerical variable with the median\n",
    "- Extract first letter from cabin\n",
    "- Group rare Categories\n",
    "- Perform One hot encoding\n",
    "- Scale features with standard scaler\n",
    "- Fit a Logistic regression"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "# set up the pipeline\n",
    "titanic_pipe = Pipeline([\n",
    "\n",
    "    # ===== IMPUTATION =====\n",
    "    # impute categorical variables with string 'missing'\n",
    "    ('categorical_imputation', ),\n",
    "\n",
    "    # add missing indicator to numerical variables\n",
    "    ('missing_indicator', ),\n",
    "\n",
    "    # impute numerical variables with the median\n",
    "    ('median_imputation', ),\n",
    "\n",
    "\n",
    "    # Extract first letter from cabin\n",
    "    ('extract_letter', ),\n",
    "\n",
    "\n",
    "    # == CATEGORICAL ENCODING ======\n",
    "    # remove categories present in less than 5% of the observations (0.05)\n",
    "    # group them in one category called 'Rare'\n",
    "    ('rare_label_encoder', ),\n",
    "\n",
    "\n",
    "    # encode categorical variables using one hot encoding into k-1 variables\n",
    "    ('categorical_encoder', ),\n",
    "\n",
    "    # scale using standardization\n",
    "    ('scaler', ),\n",
    "\n",
    "    # logistic regression (use C=0.0005 and random_state=0)\n",
    "    ('Logit', ),\n",
    "])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Pipeline(steps=[('categorical_imputation',\n",
       "                 CategoricalImputer(variables=['sex', 'cabin', 'embarked',\n",
       "                                               'title'])),\n",
       "                ('missing_indicator',\n",
       "                 AddMissingIndicator(variables=['age', 'fare'])),\n",
       "                ('median_imputation',\n",
       "                 MeanMedianImputer(variables=['age', 'fare'])),\n",
       "                ('extract_letter',\n",
       "                 ExtractLetterTransformer(variables=['cabin'])),\n",
       "                ('rare_label_encoder',\n",
       "                 RareLabelEncoder(n_categories=1,\n",
       "                                  variables=['sex', 'cabin', 'embarked',\n",
       "                                             'title'])),\n",
       "                ('categorical_encoder',\n",
       "                 OneHotEncoder(drop_last=True,\n",
       "                               variables=['sex', 'cabin', 'embarked',\n",
       "                                          'title'])),\n",
       "                ('scaler', StandardScaler()),\n",
       "                ('Logit', LogisticRegression(C=0.0005, random_state=0))])"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# train the pipeline\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Make predictions and evaluate model performance\n",
    "\n",
    "Determine:\n",
    "- roc-auc\n",
    "- accuracy\n",
    "\n",
    "**Important, remember that to determine the accuracy, you need the outcome 0, 1, referring to survived or not. But to determine the roc-auc you need the probability of survival.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "train roc-auc: 0.8450386398763523\n",
      "train accuracy: 0.7220630372492837\n",
      "\n",
      "test roc-auc: 0.8354629629629629\n",
      "test accuracy: 0.7137404580152672\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# make predictions for train set\n",
    "class_ = \n",
    "pred = \n",
    "\n",
    "# determine mse and rmse\n",
    "print('train roc-auc: {}'.format(roc_auc_score(y_train, pred)))\n",
    "print('train accuracy: {}'.format(accuracy_score(y_train, class_)))\n",
    "print()\n",
    "\n",
    "# make predictions for test set\n",
    "class_ = \n",
    "pred = \n",
    "\n",
    "# determine mse and rmse\n",
    "print('test roc-auc: {}'.format(roc_auc_score(y_test, pred)))\n",
    "print('test accuracy: {}'.format(accuracy_score(y_test, class_)))\n",
    "print()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "That's it! Well done\n",
    "\n",
    "**Keep this code safe, as we will use this notebook later on, to build production code, in our next assignement!!**"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "feml",
   "language": "python",
   "name": "feml"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.2"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": true
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
